Skip to content

Add Anima modular pipeline#13732

Open
rmatif wants to merge 6 commits into
huggingface:mainfrom
rmatif:anima
Open

Add Anima modular pipeline#13732
rmatif wants to merge 6 commits into
huggingface:mainfrom
rmatif:anima

Conversation

@rmatif
Copy link
Copy Markdown

@rmatif rmatif commented May 12, 2026

What does this PR do?

Adds modular-only support for Anima, a text-to-image model built on top of the Cosmos Predict2 DiT architecture.

This PR adds:

  • AnimaModularPipeline and AnimaAutoBlocks
  • AnimaTextConditioner
  • checkpoint conversion for the original Anima weights
  • LoRA loading support for the Cosmos transformer and Anima text conditioner
  • docs and fast modular pipeline tests

Converted weights:
https://huggingface.co/mrfatso/anima-preview3-diffusers

Fixes #13067

cc @tdrussell

Testing

uv run pytest tests/modular_pipelines/anima/test_modular_pipeline_anima.py -q
uv run ruff check src/diffusers/modular_pipelines/anima src/diffusers/models/transformers/transformer_anima.py scripts/convert_anima_to_diffusers.py tests/modular_pipelines/anima/test_modular_pipeline_anima.py
uv run python utils/check_dummies.py
uv run python utils/modular_auto_docstring.py src/diffusers/modular_pipelines/anima

Tested the converted checkpoint locally with txt2img generation and LoRA loading

Before submitting

  • This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
  • Did you read the contributor guideline?
  • Did you read our philosophy doc (important for complex PRs)?
  • Was this discussed/approved via a GitHub issue or the forum? [Feature] Add support for Anima #13067
  • Did you make sure to update the documentation with your changes?
  • Did you write any new necessary tests?

Who can review?

@yiyixuxu @asomoza

@github-actions github-actions Bot added size/L PR with diff > 200 LOC documentation Improvements or additions to documentation fixes-issue lora tests pipelines loaders and removed size/L PR with diff > 200 LOC labels May 12, 2026
@tdrussell
Copy link
Copy Markdown

I'm the creator of Anima. I've run this and confirmed it looks good in terms of correct model implementation and producing images that match the quality of other implementations. I'll let the Diffusers team comment on any Diffusers-specific questions or concerns.

@asomoza
Copy link
Copy Markdown
Member

asomoza commented May 12, 2026

@rmatif thanks a lot for this, can you please post an example code and a couple of generations with this?

@rmatif
Copy link
Copy Markdown
Author

rmatif commented May 12, 2026

can you please post an example code and a couple of generations with this?

Sure, here is a small txt2img example using the converted checkpoint:

import torch
from diffusers import AnimaPipeline

pipe = AnimaPipeline.from_pretrained(
    "mrfatso/anima-preview3-diffusers",
    torch_dtype=torch.bfloat16,
)
pipe.to("cuda")

prompt = (
    "masterpiece, best quality, very aesthetic, absurdres, 1girl, solo, silver hair, blue eyes, "
    "long hair, school uniform, sailor collar, cherry blossoms, petals, spring, soft lighting, "
    "looking at viewer, upper body, detailed background"
)
negative_prompt = (
    "worst quality, low quality, score_1, score_2, score_3, blurry, jpeg artifacts, "
    "sepia, signature, artist name"
)

image = pipe(
    prompt=prompt,
    negative_prompt=negative_prompt,
    width=1024,
    height=1024,
    num_inference_steps=25,
    guidance_scale=4.0,
    generator=torch.Generator(device="cuda").manual_seed(12341),
).images[0]

image.save("anima.png")
anima_example_1_seed_12341 anima_example_2_seed_12342

@sayakpaul
Copy link
Copy Markdown
Member

@claude can you do a review?

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 13, 2026

Claude finished @sayakpaul's task in 4m 49s —— View job


Review of PR #13732: Add Anima pipeline

  • Read repository guidelines (AGENTS.md, models.md, pipelines.md, modular.md)
  • Review modeling_anima.py (AnimaTextConditioner)
  • Review pipeline_anima.py (AnimaPipeline)
  • Review __init__.py changes
  • Review LoRA integration (lora_pipeline.py, lora_conversion_utils.py)
  • Review tests
  • Review docs
  • Trace call path & check for dead code
  • Post final review

Overall this is a clean PR that reuses existing infra well (Cosmos transformer, QwenImage VAE, FlowMatchEuler scheduler). The code is readable and the LoRA conversion logic is well-structured. A few issues to fix before merge:


Blocking issues

1. _toctree.yml — broken YAML structure (docs won't build)

docs/source/en/_toctree.yml lines 502-506 use tab characters for indentation (the rest of the file uses spaces), and the Anima + AnimateDiff entries are structurally nested under the "Audio" section title (line 501). Anima is an image pipeline and should be a sibling of AuraFlow, Bria, etc., not a child of Audio.

The insertion also appears to have displaced the AnimateDiff entry from its original position.

Fix: insert the Anima entry at the same indentation level as the other image pipelines (e.g. between the Audio section and AuraFlow), using spaces, not tabs. AnimateDiff should remain in its original position.

2. Missing dummy objects

AnimaPipeline and AnimaTextConditioner need entries in src/diffusers/utils/dummy_torch_and_transformers_objects.py. Without them, from diffusers import AnimaPipeline will fail with a confusing ImportError when torch or transformers isn't installed, instead of the friendly "requires torch and transformers" message. Every other pipeline has these — see AnimateDiffPipeline at line 845 for the pattern.

Fix this →

3. AnimaTextConditioner import ordering in src/diffusers/__init__.py

In the _import_structure dict (line 505-506), AnimaPipeline and AnimaTextConditioner are adjacent. But in the TYPE_CHECKING block, AnimaTextConditioner is at line 1314 (after all the AnimateDiff entries), separated from AnimaPipeline at line 1306. Should be right after AnimaPipeline to keep alphabetical consistency with the lazy import structure. Run make style to fix.


Non-blocking suggestions

4. F.scaled_dot_product_attention in AnimaTextConditionerAttention

modeling_anima.py:114 calls F.scaled_dot_product_attention directly. The repo convention for ModelMixin subclasses is to use dispatch_attention_fn with the AttentionModuleMixin / processor pattern (see .ai/models.md). AnimaTextConditioner inherits ModelMixin + PeftAdapterMixin and acts like a model. Using F.scaled_dot_product_attention directly means it won't benefit from attention backend dispatch (flash, sage, etc.).

This isn't blocking since the text conditioner is small relative to the transformer and runs once per generation (not in the denoising loop), but it diverges from repo convention. Consider adopting the dispatch_attention_fn pattern in a follow-up.

5. Unused target_token_weights parameter

AnimaTextConditioner.forward() declares target_token_weights (line 251) and uses it in the body (lines 287-288), but the pipeline never passes this argument. From the pipeline's call path, it's dead code. Per AGENTS.md: "do not carry unused method parameters 'for API consistency'". If this is intended for LoRA or external callers, document it; otherwise, remove it.

6. text_embed_dim in the test conditioner doesn't match target_dim

In test_anima.py:77, CosmosTransformer3DModel is created with text_embed_dim=16. The AnimaTextConditioner at line 97 has target_dim=16, which means prompt embeddings are 16-dim. However, the conditioner's out_proj maps model_dim → target_dim (16→16), and the output is what gets passed to the transformer as encoder_hidden_states. The pipeline feeds the output of text_conditioner (dim=target_dim) into the transformer which expects text_embed_dim. These match in the test (both 16), but the naming convention is worth documenting to avoid confusion — target_dim in the conditioner must equal text_embed_dim in the transformer config.

7. Consider # Copied from annotation for retrieve_timesteps

pipeline_anima.py:58 has # Copied from diffusers.pipelines.stable_diffusion.pipeline_stable_diffusion.retrieve_timesteps — good. Just making sure this passes make fix-copies.

8. Denoising loop differs significantly from Cosmos2

The Anima denoising loop (lines 498-544) doesn't use c_in/c_skip/c_out scaling, unlike the Cosmos2 pipeline which computes current_t = current_sigma / (current_sigma + 1) and applies velocity scaling. Since the model creator confirmed parity, this is presumably intentional (Anima was fine-tuned with a different training formulation). Just noting this for maintainer awareness — a brief comment in the pipeline noting the difference from standard Cosmos2 denoising would help future readers.


Suggestions / additional info: Dead code trace

I traced the call path from AnimaPipeline.__call__ through the CosmosTransformer3DModel.forward():

Parameters passed by the pipeline:

  • hidden_states
  • timestep
  • encoder_hidden_states
  • padding_mask
  • return_dict=False

Parameters NOT passed (using defaults):

  • block_controlnet_hidden_states=None — controlnet residuals, not applicable for base txt2img
  • attention_mask=None — text attention mask is not passed through. The text conditioner already zero-masks padded positions via target_attention_mask, so the output embeddings have zeros for padding, but the transformer's cross-attention still attends to those zero-padded positions. This works because the embeddings are zeroed out, not because attention is masked. Fine in practice.
  • fps=None — video-only parameter, correct for image generation
  • condition_mask=None — used for video conditioning, correct for image generation

Under default configs (concat_padding_mask=True, extra_pos_embed_type=None in the test config):

  • self.learnable_pos_embed is None when extra_pos_embed_type=Noneextra_pos_emb path is dead in this config
  • use_crossattn_projection=False (default) → crossattn_proj is never instantiated or called
  • img_context_dim_in=None (default) → img_context_proj never instantiated

These are all expected to be dead under the Anima config — they're Cosmos-specific features that Anima doesn't use. No issues here.


@yiyixuxu
Copy link
Copy Markdown
Collaborator

@rmatif

thanks for the PR! super excited to have Anima in diffusers

on high-level I have two feedbacks/questions:

(1) Can we explore having an AnimaTransformer3DModel instead of a standalone AnimaTextConditioner? Is there any reason you went with this design? I think it'd be more consistent with our conventions to have a dedicated model that (1) has shared layers copied from Cosmos with # Copied from statements and (2) includes this additional conditioning layer as a submodule. Additionally, it'd be closer to the original checkpoint format, which will make future support easier.

(2) Can we support this pipeline from Modular Diffusers directly? Given the very active community ecosystem and the continuous training/release style, Modular is a better fit — see the docs here: https://huggingface.co/docs/diffusers/main/en/modular_diffusers/overview. since you've already implemented in the standard pipeline, it would take a little refactor - happy to provide more info if interseted, we have pretty good docs for AI agents on this and I can point you to reference PRs as well

@tdrussell
Copy link
Copy Markdown

Regarding (1), what if we just subclassed the Cosmos DiT, like ComfyUI does?

The main reason to try to avoid duplicating code is that Anima's DiT architecture is identical to the Cosmos-Predict2 DiT. The only change is the LLM Adapter module (called AnimaTextConditioner in this PR). In ComfyUI the adapter lives as a submodule of the DiT for convenience, but it's not called in the forward() method since it only needs to run once for the entire diffusion process. So regardless of the structure, the pipeline code is going to be calling the adapter "manually" only once.

@yiyixuxu
Copy link
Copy Markdown
Collaborator

@tdrussell

what if we just subclassed the Cosmos DiT?

This isn't something we do in diffusers — all our models are self-contained and inherit from ModelMixin directly. We try to keep the code structure flat and easy to read

In ComfyUI the adapter lives as a submodule of the DiT for convenience, but it's not called in the forward() method since it only needs to run once for the entire diffusion process.

ohh, we usually include text condition layers in forward as well for simplicity — the performance tradeoff is typically non-significant. But if that's not the case for Anima, keeping it as a separate component like this PR would makes sense

@github-actions github-actions Bot added size/L PR with diff > 200 LOC utils labels May 13, 2026
@rmatif
Copy link
Copy Markdown
Author

rmatif commented May 13, 2026

@yiyixuxu My preference is also to keep AnimaTextConditioner as a separate component in this PR

The main reason is that Anima’s DiT is not a new architecture. The denoiser weights and forward path are the Cosmos Predict2 DiT, the Anima-specific part is the LLM adapter that turns Qwen3 hidden states + T5 token ids into the encoder_hidden_states consumed by Cosmos

Since subclassing CosmosTransformer3DModel is not a Diffusers pattern, an AnimaTransformer3DModel would probably mean copying the Cosmos transformer into a new self-contained class just to add one adapter submodule. That feels worse to me for many reasons

The checkpoint conversion does split net.llm_adapter.* into text_conditioner/, but it is still strict and direct, there is no architectural remapping beyond separating the adapter from the unchanged Cosmos DiT

If the preference is still to make an AnimaTransformer3DModel, I can do that, but I think it would mostly be a wrapper/copy around Cosmos rather than a meaningfully different model class

And agree that Modular diffusers is a good fit for Anima, Would it be okay to handle Modular support in a follow-up PR?

@tdrussell
Copy link
Copy Markdown

@yiyixuxu

ohh, we usually include text condition layers in forward as well for simplicity — the performance tradeoff is typically non-significant. But if that's not the case for Anima

It's not the case for Anima. The LLM Adapter is 6 transformer layers with both self- and cross-attention, which is heavier than what is typical in most models (often just a single MLP projection layer). Anima basically has a mini text encoder that is converting from Qwen3 embedding space to T5XXL embedding space for input to the model. It's been a while since I ran the numbers, and I didn't write it down, but I recall the LLM Adapter as being ~10% of the full forward pass. IMO this is enough to warrant being called just once for the entire diffusion loop (and is what ComfyUI does as well).

@yiyixuxu
Copy link
Copy Markdown
Collaborator

sounds good to keep AnimaTextConditione then!

Can we support Anima only through Modular Diffusers, rather than maintaining both? We've been supporting new pipelines through both, but now that Modular is officially released we're looking to shift new pipelines to modular-only. Especially since we expect Anima to be be a very actively developed model, both from the author and the community, the maintenance cost from the standard pipeline could be quite high for us.

@rmatif rmatif changed the title Add Anima pipeline Add Anima modular pipeline May 13, 2026
@rmatif
Copy link
Copy Markdown
Author

rmatif commented May 13, 2026

Can we support Anima only through Modular Diffusers, rather than maintaining both?

Fair enough, I moved everything into Modular. Looking forward to your review

Here’s the updated example:

import torch
from diffusers import AnimaAutoBlocks
from diffusers.guiders import ClassifierFreeGuidance

pipe = AnimaAutoBlocks().init_pipeline("mrfatso/anima-preview3-diffusers")
pipe.load_components(torch_dtype=torch.bfloat16)
pipe.update_components(guider=ClassifierFreeGuidance(guidance_scale=4.0))
pipe.to("cuda")

prompt = (
    "masterpiece, best quality, very aesthetic, absurdres, 1girl, solo, silver hair, blue eyes, "
    "long hair, school uniform, sailor collar, cherry blossoms, petals, spring, soft lighting, "
    "looking at viewer, upper body, detailed background"
)
negative_prompt = (
    "worst quality, low quality, score_1, score_2, score_3, blurry, jpeg artifacts, "
    "sepia, signature, artist name"
)

image = pipe(
    prompt=prompt,
    negative_prompt=negative_prompt,
    width=1024,
    height=1024,
    num_inference_steps=25,
    generator=torch.Generator(device="cuda").manual_seed(12341),
).images[0]

image.save("anima.png")

Copy link
Copy Markdown
Collaborator

@yiyixuxu yiyixuxu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i left one comment,
overall looks good to me, thanks for working on this

)


class AnimaTextConditionerBlock(nn.Module):
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ohhh but it is not a transformer. I think we have a couple of options:

  1. Create a new folder under models/ for non-standard pipeline components.
  2. Follow the same convention as in standard pipelines, host it under modular_pipelines/anima/text_conditioner.py. it requires a small change in modular from_pretrained() to work since the model is pipeline local and won't be importable on top-level

want to hear everyone's thoughts!

I think maybe it's time for (1) because it is just strange that we host model components under pipeline folders. the pipeline-local model structure was designed at the time we use same UNet and vae for every pipeline. A lot has changed since — all our models now follow the single-file pattern and pretty much every model is pipeline-specific. maybe we don't have to keep that distinction anymore

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cc @DN6 here

Comment thread docs/source/en/api/pipelines/anima.md Outdated
@yiyixuxu yiyixuxu requested a review from sayakpaul May 14, 2026 03:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Feature] Add support for Anima

5 participants